Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU2e77a5x - Add a CDB merge function #373

Merged
merged 18 commits into from
Dec 18, 2023

Conversation

adam-sutton-1992
Copy link
Collaborator

@adam-sutton-1992 adam-sutton-1992 commented Nov 22, 2023

This is early stages of this enhancement, and I'll list what I've done, questions I have, and a TODO list.

What I've done:

  • I've made it a static method within the cdb class so the parameters are not edited upon passing them - incase the user has further use for them.
  • Copy cdb1 and its config. I consider this cdb to be the base.
  • For all the name2x dict's I check if they exist in both cdbs and try to merge them in the most suitable way (i.e. name2count_train will add the number of training samples together).
  • If the name key only exists in cdb2 I add that key and value to the new cdb
  • I merged snames using a set union
  • I do a similar process for cui2x as I did for names, and merge them as suitable.
  • For context vectors where both CUIs exist I perform a weighted sum, where the weights are sourced from the cui2count_train dict. I check for each window length accordingly.
  • vocab sums the occurences for each word from both cdbs
  • addl_info prioritises adding cdb1 entries over cdb2 if keys exist in both cdbs.

Questions:

  • Is there a better way of checking the existance of keys in each dictionary than what I've done? I've gone for what I think is most readable, but with the fewest calls to save time. - such as only looping through one cdb, not two.
  • Is the weighted average function correct? Are the context vectors are always generated from the same embedding space?
  • Are there alternative methods wanted for meging context vectors?
  • What is cui2info used for? There is no documentation for it, and it seems redundant compared to addl_info.
  • How would we handle merging two addl_info with the same keys?
  • What are the ideal implementations for overwrite and vector_import params?
  • What dictionaries are safe to assume to be "always popluated"
  • Will all dictionaries that are populated have the same entries?

TODO:

  • Implement parameters added
  • Write tests for all edge cases
  • Handle KeyErrors for when cuis exist for context_vectors but don't for count_train

@adam-sutton-1992 adam-sutton-1992 added the enhancement New feature or request label Nov 22, 2023
@adam-sutton-1992 adam-sutton-1992 self-assigned this Nov 22, 2023
medcat/cdb.py Outdated Show resolved Hide resolved
medcat/cdb.py Outdated Show resolved Hide resolved
medcat/cdb.py Outdated Show resolved Hide resolved
@adam-sutton-1992
Copy link
Collaborator Author

adam-sutton-1992 commented Nov 27, 2023

Thanks for the comments. I've made some of the suggested changes locally already, when I started testing it. I'm also looking at using medcat.cdb.add_names instead of manually merging the names dictionaries, as that should be safer / modular.

I'll upload them shortly.

Regarding where this method should go - I don't see an ideal home for it currently in medcat.utils. CDBMaker could be a good home for it, although that probably would require some cleaning up.

@mart-r
Copy link
Collaborator

mart-r commented Nov 27, 2023

I'll try and answer some of the questions as well.

  • Is there a better way of checking the existance of keys in each dictionary than what I've done? I've gone for what I think is most readable, but with the fewest calls to save time. - such as only looping through one cdb, not two.
    • I can't think of a way to make this much better. In principle, if we wanted to blindly merge every dict, we could pass the the pairs of dicts to a separate method that deals with them identically. But the merging isn't identical for all of these. Plus the fact that we don't always know which dicts do and which ones don't get populated
  • Is the weighted average function correct? Are the context vectors are always generated from the same embedding space?
    • I think it makes sense. I suppose there could be a different embedding space if different vocabs are used? Though I don't know how common that is.
  • Are there alternative methods wanted for merging context vectors?
    • I don't think so since the individual trained data isn't retained so you can't go back and re-train for everything (which in theory might be a better option)
  • What is cui2info used for? There is no documentation for it, and it seems redundant compared to addl_info.
    • I'm sure you saw that medcat.utils.helpers uses it for some kind of grouping and ICD10 mappings. But I don't actually know if that's used by anyone. Though it very well might be.
  • How would we handle merging two addl_info with the same keys?
    • That's going to be difficult indeed. There's really not a lot of oversight into what goes or does not go there
    • With that said, if someone's got a very complicated set up, they should do their own custom merging
    • But we should probably at the very least warn if we encounter something we are unsure on how to merge
  • What are the ideal implementations for overwrite and vector_import params?
    • I don't actually know what purpose they're supposed to have
    • Though I could see one wanting to use overwrite to change the first of the two CDBs in place instead of returning a new one
  • What dictionaries are safe to assume to be "always popluated"
    • There isn't really much consistency here all in all
    • I ran a simple script over the 11 models I had to see which ones were populated. Take from it what you will
      • 'name2cuis2status': 11
      • 'vocab': 11
      • 'addl_info': 11
      • 'name2cuis': 11
      • 'cui2names': 9
      • 'cui2type_ids': 9
      • 'cui2context_vectors': 9
      • 'cui2count_train': 9
      • 'name2count_train': 8
      • 'name_isupper': 7
      • 'cui2preferred_name': 7
      • 'cui2snames': 6
      • 'cui2tags': 2
      • 'cui2average_confidence': 1
    • I ran the script again on the 19 models I've got, with more details
      • Bare in mind, some of these models probably haven't received any training
      • Which is why they'll have nothing in e.g cui2count_train
    • Here's the results
      • vocab 1.00 (19 / 19)
      • name2cuis2status 1.00 (19 / 19)
      • addl_info 1.00 (19 / 19)
      • name2cuis 1.00 (19 / 19)
      • name_isupper 0.79 (15 / 19)
      • cui2type_ids 0.79 (15 / 19)
      • snames 0.79 (15 / 19)
      • cui2names 0.79 (15 / 19)
      • cui2context_vectors 0.68 (13 / 19)
      • cui2count_train 0.68 (13 / 19)
      • name2count_train 0.68 (13 / 19)
      • cui2snames 0.63 (12 / 19)
      • cui2preferred_name 0.37 (7 / 19)
      • cui2tags 0.11 (2 / 19)
      • _memory_optimised_parts 0.11 (2 / 19)
      • cui2average_confidence 0.05 (1 / 19)
  • Will all dictionaries that are populated have the same entries?
    • If they're populated, they should be populated for everything. But I don't think there's any guarantee

@adam-sutton-1992
Copy link
Collaborator Author

adam-sutton-1992 commented Nov 27, 2023

About the context_vectors, and how to test merging them appropriately.

It is difficult to test if merging two cdbs will result in appropriate context_vectors calculated via a weighted average.

To test this we are going to:

  1. Create/source a dummy document(s). This can be very small / unaccurate.
  2. Train cdb1 and cdb2 on a split of the documents. Train super_cdb on the entireity of the document.
  3. Mergecdb1 and cdb2 to make merge_cdb, do the context vectors match up with super_cdb.
  4. We would expect the context vectors between merge_cdb and super_cdb to largely be the same (i.e. similarity > 0.9xxx - allowing for floating point differences

We have considered the case of negative samples affecting the context vectors - as they are random. Given a large enough corpus to train this wouldn't be an issue, however it is beyond scope for this kind of testing. So we will set the number of negative samples to 0. In a practical sense this would not be ideal, but here it is a good workaround for testing.

We would also add documentation to indicate that context vectors via merging will not be identical as if they were generated at the same time.

@mart-r
Copy link
Collaborator

mart-r commented Dec 12, 2023

Please make the GHA (flake8, mypy, tests) run successfully.
I would recommend running locally before pushing. Even though you may not cover all python versions (though you could), you'd probably still cover most things.
I just use the following in my pre push hook:

flake8 medcat
python -m mypy --follow-imports=normal medcat

(tests take a long time so I don't like them in the hook - I run them separately).

And you could then also update the top description as to what the current state of the PR is - what's been done and what's still a WIP. Perhaps convert he questions and/or TODOs to a task list with check points. Though would be great to retain stuff that fell out of the scope for whatever reason.

@adam-sutton-1992
Copy link
Collaborator Author

I've fixed the tests - I had issues with flake8 locally. Flake8 6.0 is compatible with our flake8 config file.

@mart-r
Copy link
Collaborator

mart-r commented Dec 12, 2023

I've fixed the tests - I had issues with flake8 locally. Flake8 6.0 is compatible with our flake8 config file.

requirements-dev.txt and CONTRIBUTING - we've set things up with v4.0.1.
Though there may well be reasons to upgrade. You can make a CU task if you think there is.

I'll take a look at the entire thing later today.

@adam-sutton-1992
Copy link
Collaborator Author

Just an issue on my local machine. I'm writing up a list now of the changes I've made and if there's anything else left to do.

@mart-r
Copy link
Collaborator

mart-r commented Dec 12, 2023

Just an issue on my local machine.

Just saying it's easier if you keep your virtual environment in sync with the project requirements :)

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks good.
But I think we can improve a few small things.

@@ -90,5 +90,44 @@ def test_cui2snames_population(self):
with self.subTest(cui):
self.assertIn(cui, self.undertest.cui2snames)


def test_merge_cdb(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one test seems to test a lot of different things. And have a lot of setup in the actual test method.

What I would suggest is a new class (unittest.TestCase) where in the class setup method, the two cuis are generated and merged. And then having various small test methods where each of the specific things is tested for.

Though in principle, these different test cases (i.e the different changes that you're testing for within the merge) might also benefit from being separated into different classes. That way we'd be doing (and testing for) one change at a time. But I think it's fine to do it all at once here. Tests already take a long time, would hate to slow them down even further. We can revisit at a later date if/when this becomes an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was personally unhappy with the test method being where it is, and not being very elegant.

I think if we move the method its own utils class elsewhere in medcat then splitting these out would be ideal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you see a reason to use this outside of tests, I don't think it fits in the medcat package itself.
Though there is a tests.helper module you could move some boilerplate to.
On the other hand, if this piece of code is only relevant to testing CDB merging, creating a new module for testing the merge capabilities and having the helper method there would probably make more sense.

maker2 = CDBMaker(config) # second maker is required as it will otherwise point to same object
path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "model_creator", "umls_sample.csv")
cdb1 = maker1.prepare_csvs(csv_paths=[path])
cdb2 = maker2.prepare_csvs(csv_paths=[path])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can get away with 1 CDBMaker just making two identical CDBs? That seems to be the way this class works anyway (the CDBMaker is created once upon class setup and a separate CDB is made in the setup for each individual test).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is how I assumed it would work.

Making two cdbs from the same maker seemed to point the same cdb1. i.e.

maker = CDBMaker(config)
path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "model_creator", "umls_sample.csv")
cdb1 = maker.prepare_csvs(csv_paths=[path])
cdb2 = maker.prepare_csvs(csv_paths=[path])
print(cdb1 == cdb2) 

returns true.

I'm happy to change this if I'm mistaken or have overlooked something in the building. The same config seems to be fine for CDBMaker.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right - it just modifies the same instance each time the prepare_csvs method is called.
Though merging two identical CDBs to return a third identical one could also be an edge case to test.

But what this means is that the way the entire class works is prone to side effects since in each setUp it looks like a new CDB is generated, whereas in reality the same instance is modified and returned.

tests/test_cdb.py Outdated Show resolved Hide resolved
medcat/cdb.py Outdated Show resolved Hide resolved
@adam-sutton-1992
Copy link
Collaborator Author

Changes made to medcat.cdb:

  1. Removed a lot of manual editing of cui/name dictionaries in place, in return I've used the medcat.cdb method add_concept. This method has additional checks and can gracefully handle some dictionaries stored in medcat.cdb.addl_info.
  2. Context vectors now account for prioritizing the training from either CDB provided overwrite_training is provided indicating which cdb is the priority. This has been explained as useful when merging CDB's with one CDB coming from a more specialised traning process (i.e. cancer cui's should be better represented from a CDB that has supervised training in that domain). cui2count_train is also handled according to the overwrite parameter.
  3. If this parameter isn't provided a weighted average is used in generating new context vectors.
  4. snames and vocab dicts are handled the same as previous.

Added testing to tests.test_cdb.py:

  1. I generate two cdb using the umls_sample.csv.
  2. I then introduce some changes to each cdb; new cuis, new context vectors, additional training counts.
  3. I merge the cdbs
  4. I test new cuis being in the merged cdb as intended, along with testing the weighted average functionality.

These tests might not be extensive, if theres any tests you'd like to see let me know and I'll make sure to add them.

Regarding a cdb generated from a full dataset (full_cdb) vs a merged cdb (merge_cdb) from two cdbs split on the same dataset:

We agreed that this test isn't suitable for unit testing. This test was proven to be quite difficult to collect suitable results on very small training datasets. I managed to test with and without negative sampling and found similar results between them regardless. Due to the small size of the training document - the vectors generated in both cdbs were orthogonal to each other, so I added some overlap between the documents. This increased the similarity between the cdbs, and then when comparing a fully trained cdb to a merged one the performances were as expected.

For a fairer comparison I tested the vector merging method from medcat.cdb.import_training. Where we have a weighted average, import_training is just an mean of both context vectors. The similarities between the merged_cdb and full_cdb in both cases were very similar, with the weighted average being slightly higher.

Potential changes to be done:

  1. Where does this go? At the moment it's housed in medcat.cdb. We could also find a more suitable place for it in medcat.cdb_maker or medcat.utils. I think cdb_maker is a fair place, as it makes a cdb and could then use the config that is passed to the CDBMaker for the new, merged CDB.
  2. overwrite_training essentially changes the weighting to 0 and 1 for each CDB. It could also be done via passing a tuple that would decide any range of weighting. I'm unsure of any benefit are arbitarily choosing the weighting when generating context vectors.
  3. More unit testing. There may be more edge cases that are worth testing.

@mart-r
Copy link
Collaborator

mart-r commented Dec 13, 2023

  • Where does this go? At the moment it's housed in medcat.cdb. We could also find a more suitable place for it in medcat.cdb_maker or medcat.utils. I think cdb_maker is a fair place, as it makes a cdb and could then use the config that is passed to the CDBMaker for the new, merged CDB.

The only reason I'd be inclined to not put the method in cdb_maker is the fact that generally this module isn't used anywhere other than when generating a new CDB.
So if we truly wanted it not to be in this module, I'd be inclined to have it in its own module in the medcat.utils package.
Though in either case, if we move it out we loose the discoverability of having it attached to the CDB.
(we could have a method in the CDB that delegates to the actual method somewhere else, but we'd end up duplicating the docstring, which I'm not really a fan of).

  1. overwrite_training essentially changes the weighting to 0 and 1 for each CDB. It could also be done via passing a tuple that would decide any range of weighting. I'm unsure of any benefit are arbitarily choosing the weighting when generating context vectors.
  2. More unit testing. There may be more edge cases that are worth testing.

I think it might be worth trying to set up tests for the different overwrite_training values.
And technically, we'd probably want something for different test for full_build=True as well, but I don't know if you've got a good use case to demonstrate its effectiveness.

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm -

Couple of things though.

  • Can we move to a medcat.utils.cdb module.
    • rationale for moving it into this new module is that there is a cdb filter method somewhere that should also go here
  • A tutorial in how to use this functionality would nice. Perhaps in a specialised tutorial. Called something like CDB management? https://github.com/CogStack/MedCATtutorials
  • responding to thread RE tests: side effects on tests is a recipe for future frustration. Please fix with setup local to class in the test etc.

@mart-r
Copy link
Collaborator

mart-r commented Dec 13, 2023

  • responding to thread RE tests: side effects on tests is a recipe for future frustration. Please fix with setup local to class in the test etc.

I agree. But I think it's outside the scope of this PR since it's the pre-existing CDBTests setup that's introducing that.
I'll create a CU task. And do a quick PR.

@adam-sutton-1992
Copy link
Collaborator Author

Few more changes which I think finalise this:

  1. Set up different tests for overwrite_training=True and full_build.
  2. Moved the set up of CDBs into tests.helper as suggested
  3. Bug fixes due to added testing (thanks)
  4. Removed redundant checks for certain cases

If theres anything before I merge the request I just want to check if there are any further changes I've missed in the previous suggestions.

Regarding the move of this work to another module, should that be another CU also?

@mart-r
Copy link
Collaborator

mart-r commented Dec 14, 2023

Few more changes which I think finalise this:

  1. Set up different tests for overwrite_training=True and full_build.
  2. Moved the set up of CDBs into tests.helper as suggested
  3. Bug fixes due to added testing (thanks)
  4. Removed redundant checks for certain cases

That all looks great!

If theres anything before I merge the request I just want to check if there are any further changes I've missed in the previous suggestions.

I think the tests could still benefit from being being separated into multiple tests. Especially since you added more tests to the same test method.
I.e something along the lines:

class CDBMergeTests(unittest.TestCase):
    @classmethod
    def setUpClass(cls) -> None:
        to_merge = ForCDBMerging()
        cls.cdb1 = to_merge.cdb1
        cls.cdb2 = to_merge.cdb2
        cls.merged_cdb = CDB.merge_cdb(cdb1=cdb1, cdb2=cdb2)
        cls.merged_overwrite_cdb = CDB.merge_cdb(cdb1=cdb1, cdb2=cdb2, overwrite_training=2, full_build=True)
   # tests for different things - properly named and (most likely) one per assert

By separating the tests into multiple properly named ones, we'll have more granularity about what breaks (if it does) in the future, instead of just "test_cdb_merge" failed, it'd say something like "test_cdb_merge_name_in_cui2names" failed.

Regarding the move of this work to another module, should that be another CU also?

If Tom said he'd like it in medcat.utils.cdb, then that's what it'll be :)
So I'd say move it in here. Though don't bother looking for the other thing he said should go there, that's definitely outside the scope of this PR.

@adam-sutton-1992
Copy link
Collaborator Author

adam-sutton-1992 commented Dec 14, 2023

I've moved everything over, and split out the tests.

I've named the medcat.utils.cdb module as medcat.utils.cdb_utils keeping form with similar modules. I've also created the testing.utils.test_cdb_utils module.

I'll merge when these changes get a thums up.

medcat/utils/cdb_utils.py Outdated Show resolved Hide resolved
@mart-r
Copy link
Collaborator

mart-r commented Dec 15, 2023

Once the two comments I've made are resolved, I'd be happy for a merge.

tests/test_cdb.py Outdated Show resolved Hide resolved
@mart-r
Copy link
Collaborator

mart-r commented Dec 15, 2023

Looks all good to me now.

Wait for GHA to do its magic and we'll be good to merge on my end.

@adam-sutton-1992 adam-sutton-1992 merged commit 70305f4 into master Dec 18, 2023
5 checks passed
@mart-r
Copy link
Collaborator

mart-r commented Jan 2, 2024

@adam-sutton-1992
For future reference, let's try and do a "Squash and merge" rather than the regular marge (there's an option under the arrow for a PR and it'll remember your preference). That way we avoid a bunch of micro-commits on the master branch that are only really relevant in the context of a specific PR instead of the entire project.

@adam-sutton-1992
Copy link
Collaborator Author

@mart-r

Hi sorry, I will do in the future.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants